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Abstract 

Two ubiquitous aspects of large-scale data analysis are that the data often have heavy- 
tailed properties and that diffusion-based or spectral-based methods are often used to iden- 
tify and extract structure of interest. Perhaps surprisingly, popular distribution-independent 
methods such as those based on the VC dimension fail to provide nontrivial results for even 
simple learning problems such as binary classification in these two settings. In this paper, 
we develop distribution-dependent learning methods that can be used to provide dimension- 
independent sample complexity bounds for the binary classification problem in these two 
popular settings. In particular, we provide bounds on the sample complexity of maximum 
margin classifiers when the magnitude of the entries in the feature vector decays according 
to a power law and also when learning is performed with the so-called Diffusion Maps ker- 
nel. Both of these results rely on bounding the annealed entropy of gap-tolerant classifiers 
in a Hilbert space. We provide such a bound, and we demonstrate that our proof technique 
generalizes to the case when the margin is measured with respect to more general Banach 
space norms. The latter result is of potential interest in cases where modeling the relationship 
between data elements as a dot product in a Hilbert space is too restrictive. 

1 Introduction 

Two ubiquitous aspects of large-scale data analysis are that the data often have heavy-tailed prop- 
erties and that diffusion-based or spectral-based methods are often used to identify and extract 
structure of interest. In the absence of strong assumptions on the data, popular distribution- 
independent methods such as those based on the VC dimension fail to provide nontrivial results 
for even simple learning problems such as binary classification in these two settings. At root, the 
reason is that in both of these situations the data are formally very high dimensional and that 
(without additional regularity assumptions on the data) there may be a small number of "very 
outlying" data points. In this paper, we develop distribution-dependent learning methods that 
can be used to provide dimension-independent sample complexity bounds for the maximum mar- 
gin version of the binary classification problem in these two popular settings. In both cases, we are 
able to obtain nearly optimal linear classification hyperplanes since the distribution-dependent 
tools we employ are able to control the aggregate effect of the "outlying" data points. In partic- 
ular, our results will hold even though the data may be infinite-dimensional and unbounded. 



1.1 Overview of the problems 

Spectral-based kernels have received a great deal of attention recently in machine learning for 
data classification, regression, and exploratory data analysis via dimensionality reduction |25j . 
Consider, for example, Laplacian Eigenmaps [2] and the related Diffusion Maps [6]. Given a 
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graph G = {V, E) (where this graph could be constructed from the data represented as feature 
vectors, as is common in machine learning, or it could simply be a natural representation of 
a large social or information network, as is more common in other areas of data analysis), let 
/o, /i, . . . , /n be the eigenfunctions of the normalized Laplacian of G and let lo,li, . . . ,ln be the 
corresponding eigenvalues. Then, the Diffusion Map is the following feature map 

and Laplacian Eigenmaps is the special case when k = 0. In this case, the support of the data 
distribution is unbounded as the size of the graph increases; the VC dimension of hyperplane 
classifiers is G (n) ; and thus existing results do not give dimension- independent sample complexity 
bounds for classification by Empirical Risk Minimization (ERM). Moreover, it is possible (and 
indeed quite common in certain applications) that on some vertices v the eigenfunctions fluctuate 
wildly — even on special classes of graphs, such as random graphs G{n,p), a non-trivial uniform 
upper bound stronger than 0(n) on ||$(t>)|| over all vertices v does not appear to be known0 
Even for maximum margin or so-called "gap-tolerant" classifiers, defined precisely in Section [2] 
and which are easier to learn than ordinary linear hyperplane classifiers, the existing bounds of 
Vapnik are not independent of the number n of nodesH 

A similar problem arises in the seemingly very-different situation that the data exhibit heavy- 
tailed or power-law behavior. Heavy-tailed distributions are probability distributions with tails 
that are not exponentially bounded |241 [5]. Such distributions can arise via several mechanisms, 
and they are ubiquitous in applications [5]. For example, graphs in which the degree sequence 
decays according to a power law have received a great deal of attention recently. Relatedly, 
such diverse phenomenon as the distribution of packet transmission rates over the internet, the 
frequency of word use in common text, the populations of cities, the intensities of earthquakes, 
and the sizes of power outages all have heavy-tailed behavior. Although it is common to normalize 
or preprocess the data to remove the extreme variability in order to apply common data analysis 
and machine learning algorithms, such extreme variability is a fundamental property of the data 
in many of these application domains. 

There are a number of ways to formalize the notion of heavy-tailed behavior for the classifica- 
tion problems we will consider, and in this paper we will consider the case where the magnitude 
of the entries decays according to a power law. (Note, though, that in Appendix El we will, for 
completeness, consider the case in which the probability that an entry is nonzero decays in a 
heavy-tailed manner.) That is, if 

^ (Mv),-- ■,4>n(.v)) 

represents the feature map, then (j)i{v) < Gi~" for some absolute constant C > 0, with a > 1. As 
in the case with spectral kernels, in this heavy-tailed situation, the support of the data distribution 
is unbounded as the size of the graph increases, and the VC dimension of hyperplane classifiers 

^It should be noted that, while potentially problematic for what we are discussing in this paper, such eigenvector 
localization often has a natural interpretation in terms of the processes generating the data and can be useful in 
many data analysis applications. For example, it might correspond to a high degree node or an articulation point 
between two clusters in a large informatics graph [171 1161 118| : or it might correspond to DNA single-nucleotide 
polymorphisms that are particularly discriminative in simple models that are chosen for computational rather than 
statistical reasons |19l 123) . 

^VC theory provides an upper bound of O ((n/A) ) on the VC dimension of gap-tolerant classifiers applied to 
the Diffusion Map feature space corresponding to a graph with n nodes. (Recall that by Lemma [2] below, the VC 
dimension of the space of gap-tolerant classifiers corresponding to a margin A, applied to a ball of radius R is 
~ (R/Ay.) Of course, although this bound is quadratic in the number of nodes, VC theory for ordinary linear 
classifiers gives an 0(n) bound. 
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is @{n). Moreover, although there are a small number of "most important" features, they do not 
"capture" most of the "information" of the data. Thus, when calculating the sample complexity 
for a classification task for data in which the feature vector has heavy-tailed properties, bounds 
that do not take into account the distribution are likely to be very weak. 

In this paper, we develop distribution-dependent bounds for problems in these two settings. 
Clearly, these results are of interest since VC-based arguments fail to provide nontrivial bounds 
in these two settings, in spite of ubiquity of data with heavy-tailed properties and the widespread 
use of spectral-based kernels in many applications. More generally, however, these results are of 
interest since the distribution-dependent bounds underlying them provide insight into how better 
to deal with heterogeneous data with more realistic noise properties. 

1.2 Summary of our main results 

Our first main result provides bounds on classifying data whose magnitude decays in a heavy-tailed 
manner. In particular, in the following theorem we show that if the weight of the i^^ coordinate 
of random data point is less than Ci""' for some C > 0, a > 1, then the number of samples 
needed before a maximum-margin classifier is approximately optimal with high probability is 
independent of the number of features. 

Theorem 1 (Heavy- Tailed Data) Let the data be heavy-tailed in that the feature vector: 

satisfy \4>i{v)\ < Ci""' for some absolute constant C > 0, with a > 1. Let C(-) denote the Riemann 
zeta function. Then, for any t, if a maximum margin classifier has a margin > A, with probability 
more than 1 — S, its risk is less than 



where O hides multiplicative poly logarithmic factors. 

This result follows from a bound on the annealed entropy of gap-tolerant classifiers in a Hilbert 
space that is of independent interest. In addition, it makes important use of the fact that although 
individual elements of the heavy-tailed feature vector may be large, the vector has bounded 
moments. 

Our second main result provides bounds on classifying data with spectral kernels. In par- 
ticular, in the following theorem we give dimension-independent upper bounds on the sample 
complexity of learning a nearly-optimal maximum margin classifier in the feature space of the 
Diffusion Maps. 

Theorem 2 (Spectral Kernels) Let the following Diffusion map be given: 



where fi are normalized eigenf unctions (whose ^2(a*) ) norm isl, n being the uniform distribution), 
li are the eigenvalues of the corresponding Markov Chain and A; > 0. Then, for any I, if a 
maximum margin classifier has a margin > A, with probability more than 1 — 6, its risk is less 
than 




i 



^■.v^il1fi{v),...,l'^ 
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where O hides multiplicative poly logarithmic factors. 
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As with the proof of our main heavy-tailed learning result, the proof of our main spectral learning 
result makes essential use of an upper bound on the annealed entropy of gap-tolerant classifiers. 
In applying it, we make important use of the fact that although individual elements of the feature 
vector may fluctuate wildly, the norm of the Diffusion Map feature vector is bounded. 

As a side remark, note that we are not viewing the feature map in Theorem [5] as necessarily 
being either a random variable or requiring knowledge of some marginal distribution — as might be 
the case if one is generating points in some space according to some distribution; then constructing 
a graph based on nearest neighbors; and then doing diffusions to construct a feature map. Instead, 
we are thinking of a data graph in which the data are adversarially presented, e.g., a given social 
network is presented, and diffusions and/or a feature map is then constructed. 

These two theorems provides a dimension-independent (i.e., independent of the size n of the 
graph and the dimension of the feature space) upper bound on the number of samples needed 
to learn a maximum margin classifier, under the assumption that a heavy-tailed feature map or 
the Diffusion Map kernel of some scale is used as the feature map. As mentioned, both proofs 
(described below in Sections 13.31 and 13. 4p proceed by providing a dimension-independent upper 
bound on the annealed entropy of gap-tolerant classifiers in the relevant feature space, and then 
appealing to Theorem [5] (in Section [2]) relating the annealed entropy to the generalization error. 
For this bound on the annealed entropy of these gap-tolerant classifiers, we crucially use the fact 
that Et,||$(z;)|p is bounded, even if sup^ ||$(i;)|| is unbounded as n — cxd. That is, although 
bounds on the individual entries of the feature map do not appear to be known, we crucially use 
that there exist nontrivial bounds on the magnitude of the feature vectors. Since this bound is 
of more general interest, we describe it separately. 

1.3 Summary of our main technical contribution 

The distribution-dependent ideas that underlie our two main results (in Theorems [T] and [2]) can 
also be used to bound the sample complexity of a classification task more generally under the 
assumption that the expected value of a norm of the data is bounded, i.e., when the magnitude 
of the feature vector of the data in some norm has a finite moment. In more detail: 

• Let P be a probability measure on a Hilbert space and let A > 0. In Theorem [6] (in 
Section [3. ip . we prove that if E-p||3;|p = r"^ < oo, then the annealed entropy of gap-tolerant 
classifiers (defined in Section [2]) in % can be upper bounded in terms of a function of r. A, 
and (the number of samples) I, independent of the (possibly infinite) dimension of %. 

It should be emphasized that the assumption that the expectation of some moment of the norm 
of the feature vector is bounded is a much weaker condition than the more common assumption 
that the largest element is bounded, and thus this result is likely of more general interest in 
dealing with heterogeneous and noisy data. For example, similar ideas have been applied recently 
to the problem of bounding the sample complexity of learning smooth cuts on a low-dimensional 
manifold j22j . 

To establish this result, we use a result (See Lemma[2]in Section [3.21 ) that the VC dimension 
of gap-tolerant classifiers in a Hilbert space when the margin is A over a bounded domain such 
as a ball of radius R is bounded above by [iZ^/A^J + 1. Such bounds on the VC dimension 
of gap-tolerant classifiers have been stated previously by Vapnik [27]. However, in the course 
of his proof bounding the VC dimension of a gap-tolerant classifier whose margin is A over a 
ball of radius R (See [27|, page 353.), Vapnik states, without further justification, that due to 
symmetry the set of points in a ball that is extremal in the sense of being the hardest to shatter 
with gap-tolerant classifiers is the regular simplex. Attention has been drawn to this fact by 
Burges (See [1], footnote 20.), who mentions that a rigorous proof of this fact seems to be absent. 
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Here, we provide a new proof of the upper bound on the VC dimension of such classifiers without 
making this assumption. (See Lemma [2] in Section 13.21 and its proof.) Hush and Scovel [12] 
provide an alternate proof of Vapnik's claim; it is somewhat different than ours, and they do not 
extend their proof to Banach spaces. 

The idea underlying our new proof of this result generalizes to the case when the data need 
not have compact support and where the margin may be measured with respect to more general 
norms. In particular, we show that the VC dimension of gap-tolerant classifiers with margin A 
in a ball of radius ii in a Banach space of Rademacher type p S (1,2] and type constant T is 

bounded above by ~ (3ri?/A)p-i , and that there exists a Banach space of type p (in fact £p) for 

p 

which the VC dimension is bounded below by (i?/A)p-i. (See Lemmas H] and [5] in Section [4.2[ ) 
Using this result, we can also prove bounds for the annealed entropy of gap-tolerant classifiers in a 
Banach space. (See Theorem [7| in Section [4.31 ) In addition to being of interest from a theoretical 
perspective, this result is of potential interest in cases where modeling the relationship between 
data elements as a dot product in a Hilbert space is too restrictive, and thus this may be of 
interest, e.g., when the data are extremely sparse and heavy-tailed. 

1.4 Maximum margin classification and ERM with gap-tolerant classifiers 

Gap-tolerant classifiers — see Section [2] for more details — are useful, at least theoretically, as a 
means of implementing structural risk minimization (see, e.g., Appendix A. 2 of [4J). With gap- 
tolerant classifiers, the margin A is fixed before hand, and does not depend on the data. See, 
e.g., [9l lll[[T2l l26j. With maximum margin classifiers, on the other hand, the margin is a function 
of the data. In spite of this difference, the issues that arise in the analysis of these two classifiers 
are similar. For example, through the fat-shattering dimension, bounds can be obtained for the 
maximum margin classifier, as shown by Shawe- Taylor et al. [26] . Here, we briefly sketch how 
this is achieved. 

Definition 1 Let T he a set of real valued functions. We say that a set of points xi,. . . ,Xs 
is ^—shattered by T if there are real numbers ti,...,ts such that for all binary vectors b = 
{bi, . . . ,bs) and each i £ [s] = {1, . . . , s} , there is a function /b satisfying, 



For each 7 > 0, the fat shattering dimension iatj^{'~f) of the set J- is defined to be the size of the 
largest j— shattered set if this is finite; otherwise it is declared to be infinity. 

Note that, in this definition, tj can be different for different i, which is not the case in gap-tolerant 
classifiers. However, one can incorporate this shift into the feature space by a simple construction. 
We start with the following definition of a Banach space of type p with type constant T. 

Definition 2 (Banach space, type, and type constant) A Banach space is a complete normed 
vector space. A Banach space B is said to have (Rademacher) type p if there exists T < 00 such 
that for all n and xi, . . . ,Xn £ B 



The smallest T for which the above holds with p equal to the type, is called the type constant of B. 




(1) 



n n 



p 
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Given a Banach space B of type p and type constant T, let B' consist of all tuples (v, c) ior v G B 
and c G M, with the norm 

\\{v,c)y :=(||t;||P + |cni/P. 

Noting that if i3 is a Banach space of type p and type constant T (see Sections 14.11 and l 4.2p . one 
can easily check that B' is a Banach space of type p and type constant max(T, 1). 

In our distribution-specific setting, we cannot control the fat-shattering dimension, but we 
can control the logarithm of the expected value of 2'^^^^^^^"'^^ for any constant k by applying 
Theorem[7]to B' . As seen from Lemma 3.7 and Corollary 3.8 of the journal version of |26] . this is 
all that is required for obtaining generalization error bounds for maximum margin classification. 
In the present context, the logarithm of the expected value of the exponential of the fat shattering 
dimension of linear 1-Lipschitz functions on a random data set of size I taken i.i.d from "P on is 
bounded by the annealed entropy of gap-tolerant classifiers on B' with respect to the push-forward 
V' of the measure V under the inclusion B ^ B'. 

This allows us to state the following theorem, which is an analogue of Theorem 4.17 of the 
journal version of [26j, adapted using Theorem [7] of this paper. 

Theorem 3 Let A > 0. Suppose inputs are drawn independently according to a distribution V be 
a probability measure on a Banach space B of type p and type constant T, and E-p||x||^ = < oo. 
If we succeed in correctly classifying £ such inputs by a maximum margin hyperplane of margin 
A, then with confidence 1 — 5 the generalization error will be bounded from above by 

i ' 

where O hides multiplicative poly logarithmic factors involving i,T,r and A. 

Specializing this theorem to a Hilbert space, we have the following theorem as a corollary. 

Theorem 4 Let A > 0. Suppose inputs are drawn independently according to a distribution V 
be a probability measure on a Hilbert space T-L, and E-p||a;|p = < oo. // we succeed in correctly 
classifying £ such inputs by a maximum margin hyperplane with margin A, then with confidence 
1 — 5 the generalization error will be bounded from above by 




where O hides multiplicative poly logarithmic factors involving i,r and A. 

Note that Theorem [His an analogue of Theorem 4.17 of the journal version of [26], but adapted 
using Theorem [6] of this paper. In particular, note that this theorem does not assume that the 
distribution is contained in a ball of some radium R, but instead it assumes only that some 
moment of the distribution is bounded. 

1.5 Outline of the paper 

In the next section, Section [21 we review some technical preliminaries that we will use in our 
subsequent analysis. Then, in Section [3l we state and prove our main result for gap-tolerant 
learning in a Hilbert space, and we show how this result can be used to prove our two main 
theorems in maximum margin learning. Then, in Section [H we state and prove an extension 
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of our gap-tolerant learning result to the case when the gap is measured with respect to more 
general Banach space norms; and then, in Sections [5] and [6] we provide a brief discussion and 
conclusion. Finally, for completeness, in Appendix [Aj we will provide a bound for exact (as 
opposed to maximum margin) learning in the case in which the probability that an entry is 
nonzero (as opposed to the value of that entry) decays in a heavy-tailed manner. 

2 Background and preliminaries 

In this paper, we consider the supervised learning problem of binary classification, i.e., we consider 
an input space X {e.g., a Euclidean space or a Hilbert space) and an output space 3^, where 
3^ = {—1,+!}, and where the data consist of pairs {X,Y) £ X x y that are random variables 
distributed according to an unknown distribution. We shall assume that for any X, there is at 
most one pair {X,Y) that is observed. We observe £ i.i.d. pairs {Xi,Yi),i = !,...,£ sampled 
according to this unknown distribution, and the goal is to construct a classification function 
a : X ^ y which predicts y from X with low probability of error. 

Whereas an ordinary linear hyperplane classifier consists of an oriented hyperplane, and points 
are labeled ±1, depending on which side of the hyperplane they lie, a gap-tolerant classifier 
consists of an oriented hyperplane and a margin of thickness A in some norm. Any point outside 
the margin is labeled ±1, depending on which side of the hyperplane it falls on, and all points 
within the margin are declared "correct," without receiving a ±1 label. This latter setting has 
been considered in [27111] (as a way of implementing structural risk minimization — apply empirical 
risk minimization to a succession of problems, and choose where the gap A that gives the minimum 
risk bound). 

The risk R{a) of a linear hyperplane classifier a is the probability that a misclassifies a 
random data point {x,y) drawn from V; more formally, R{a) := Ep[q(x) ^ y]. Given a set 
of £ labeled data points {xi,yi), . . . , {xi,yi), the empirical risk Rempic(,£) of a linear hyperplane 
classifier a is the frequency of misclassification on the empirical data; more formally, Remp{ot, i) '■= 
J 'I2i=i -^[^i 7^ Vi]^ where denotes the indicator of the respective event. The risk and empirical 
risk for gap-tolerant classifiers are defined in the same manner. Note, in particular, that data 
points labeled as "correct" do not contribute to the risk for a gap-tolerant classifier, i.e., data 
points that are on the "wrong" side of the hyperplane but that are within the A margin are not 
considered as incorrect and do not contribute to the risk. 

In classification, the ultimate goal is to find a classifier that minimizes the true risk, i.e., 
arg miUagA i?(a), but since the true risk R(a) of a classifier a is unknown, an empirical surrogate 
is often used. In particular. Empirical Risk Minimization (ERM) is the procedure of choosing 
a classifier a from a set of classifiers A by minimizing the empirical risk argmin^gA -Remp(«)^)- 
The consistency and rate of convergence of ERM — see [27] for precise definitions — can be related 
to uniform bounds on the difference between the empirical risk and the true risk over all a E A. 
There is a large body of literature on sufficient conditions for this kind of uniform convergence. 
For instance, the VC dimension is commonly-used toward this end. Note that, when considering 
gap-tolerant classifiers, there is an additional caveat, as one obtains uniform bounds only over 
those gap-tolerant classifiers that do not contain any data points in the margin — the appendix 
A. 2 of [4J addresses this issue. 

In this paper, our main emphasis is on the annealed entropy: 

Definition 3 (Annealed Entropy) Let V he a probability measure supported on a vector space 
%. Given a set A of decision rules and a set of points Z = {zi, . . . , zi} C %, let N^{zi, . . . ,Z() be 
the number of ways of labeling {zi, . . . , z{\ into positive and negative samples such that there exists 
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a gap-tolerant classifier that predicts incorrectly the label of each Zi. Given the above notation, 

Htnik) :=lnEpxfciV^(zi,...,Zfc) 
is the annealed entropy of the classifier A with respect to V. 

Note that although we have defined the annealed entropy for general decision rules, below we will 
consider the case that A consists of linear decision rules. 

As the following theorem states, the annealed entropy of a classifier can be used to get an 
upper bound on the generalization error. This follows from Theorem 8 in [3] and a remark on page 
198 of [8j. Note that the class A* is itself random, and consequently, sup„g^* R{a) — Remp{o:,£) 
is the supremum over a random class. 

Theorem 5 Let A* be the family of all gap-tolerant classifiers such that no data point lies inside 
the margin. Then, 




holds true, for any number of samples £ and for any error parameter e. 

The key property of the function class that leads to uniform bounds is the sublinearity of the 
logarithm of the expected value of the "growth function," which measures the number of distinct 
ways in which a data set of a particular size can be split by the function class. A finite VC bound 
guarantees this in a distribution-free setting. The annealed entropy is a distribution-specific 
measure, i.e., the same family of classifiers can have different annealed entropies when measured 
with respect to different distributions. For a more detailed exposition of uniform bounds in the 
context of gap-tolerant classifiers, we refer the reader to ([3], Appendix A. 2). 

Note also that normed vector spaces (such as Hilbert spaces and Banach spaces) are relevant 
to learning theory for the following reason. Data are often accompanied with an underlying 
metric which carries information about how likely it is that two data points have the same 
label. This makes concrete the intuition that points with the same class label are clustered 
together. Many algorithms cannot be implemented over an arbitrary metric space, but require 
a linear structure. If the original metric space does not have such a structure, as is the case 
when classifying for example, biological data or decision trees, it is customary to construct a 
feature space representation, which embeds data into a vector space. We will be interested in the 
commonly-used Hilbert spaces, in which distances in the feature space are measure with respect 
to the £2 distance (as well as more general Banach spaces, in Section |4]). 

Finally, note that our results where the margin is measured in £2 can be transferred to a 
setting with kernels. Given a kernel k{-,-), it is well known that linear classification using a 
kernel k{-,-) is equivalent to mapping x onto the functional k{x,-) and then finding a separating 
halfspace in the Reproducing Kernel Hilbert Space (RKHS) which is the Hilbert Space generated 
by the functionals of the form k{x, ■). Since the span of any finite set of points in a Hilbert Space 
can be isometrically embedded in £2, our results hold in the setting of kernel-based learning as 
well, when one first uses the feature map x 1— )■ /c(x, •) and works in the RKHS. 

3 Gap-tolerant classifiers in Hilbert spaces 

In this section, we state and prove Theorem [6l our main result regarding an upper bound for 
the annealed entropy of gap-tolerant classifiers in £2- This result is of independent interest, and 
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it was used in a crucial way in the proof of Theorems [T] and [2j We start in Section 13.11 with 
the statement and proof of Theorem [6l and then in Section 13.21 we bound the VC dimension of 
gap-tolerant classifiers over a ball of radius R. Then, in Section 13.31 we apply these results to 
prove our main theorem on learning with heavy-tailed data, and finally in Section 13.41 we apply 
these results to prove our main theorem on learning with spectral kernels. 



3.1 Bound on the annealed entropy of gap-tolerant classifiers in Hilbert spaces 

The following theorem is our main result regarding an upper bound for the annealed entropy of 
gap-tolerant classifiers. The result holds for gap-tolerant classification in a Hilbert space, i.e., 
when the distances in the feature space are measured with respect to the £2 norm. Analogous 
results hold when distances are measured more generally, as we will describe in Section HI 

Theorem 6 (Annealed entropy; Upper bound; Hilbert Space) Let V be a probability mea- 
sure on a Hilbert space H, and let A > 0. //E-p||x|p = < 00, then then the annealed entropy 
of gap-tolerant classifiers in %, where the gap is A, is 



Proof: Let i independent, identically distributed (i.i.d) samples zi, . . . ,Z£ be chosen from V. We 
partition them into two classes: 

X = {xi, . . . ,Xi_k} ■= {zi I ll^ill > R}, 

and 

y = {yi, ■■■^Vk} ■■= {zi I \\zi\\ < R}. 

Our objective is to bound from above the annealed entropy H^^^{i) = ln'E[N^{zi, . . . ,Z£)]. By 
Lemma [U is sub- multiplicative. Therefore, 

N^{zi,...,ze) < N^{xi,...,Xi_k)N^{yi,---,yk)- 
Taking an expectation over i i.i.d samples from V, 

E[7V^(zi, . . .,ze)] < E[7V^(xi, . . . , xe^k)N^{yi, ■ ■ .,yk)]. 
Now applying Lemma O we see that 

E[Ar^(zi, . . . , ze)] < E[iV^(xi, . . . , + 1)^'/^'+^]. 

Moving {k + outside this expression, 

E[iV^(zi, . . . , ze)] < E[iV^(xi, . . . , x,.k)]{k + 1)^Va^+i. 

Note that N^{xi, . . . ,X£_j^) is always bounded above by 2^~^ and that the random variables 
> R]] are i.i.d. Let p = P[||a:i|| > R], and note that i — k is the sum of i independent 
Bernoulli variables. Moreover, by Markov's inequality, 

n\\z^\\>R] < ^^""'^"'^ 



and therefore p < (-^)^. In addition 



i?2 

E[AfA(xi,...,x,_fe)] <E[2^-'=] 
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Let /[•] denote an indicator variable. E[2^ can be written as 



J-[lE[2^[|l..ll>^J] = (l + p)^<eP^. 



1=1 

Putting everything together, we see that 



E[N^{z,, ...,ze)]< exp ( ^ (^)' + + 1) + 1 



If we substitute R = (£r^A^)4 , it follows that 



HLnii) = logE[iV^(zi,...,z,)] 
.A 



< (^lf^)+l)(l+ln(^+l)). 



o 

For ease of reference, we note the following easily established fact about N^. This lemma is 
used in the proof of Theorem [6] above and Theorem [7] below. 

Lemma 1 Let {xi, . . . , xi} U {yi, . . . , yk} be a partition of the data Z into two parts. Then, 
is suhmultiplicative in the following sense: 

iV^(xi,. . . ...yk)< N^{xi, . . . ,Xi)N^{yi, . . .,yk). 

Proof: This holds because any partition of Z := {xi, . . . ,xe,yi, ■ ■ ■ ,yk} into two parts by an 
element X G A induces such a partition for the sets {xi, . . . , x^} and {yi, . . . , y^}, and for any pair 
of partitions of {xi, . . . ,X£} and {yi, . . . ,yk}, there is at most one partition of Z that induces 
them. 

o 

3.2 Bound on the VC dimension of gap-tolerant classifiers in Hilbert spaces 

As an intermediate step in the proof of Theorem [6l we needed a bound on the VC dimension of 
a gap-tolerant classifier within a ball of fixed radius. Lemma [2] below provides such a bound and 
is due to Vapnik j27]. Note, though, that in the course of his proof (See [27], page 353.), Vapnik 
states, without further justification, that due to symmetry the set of points that is extremal in the 
sense of being the hardest to shatter with gap-tolerant classifiers is the regular simplex. Attention 
has also been drawn to this fact by Burges ([1], footnote 20), who mentions that a rigorous proof 
of this fact seems to be absent. Vapnik's claim has since been proved by Hush and Scovel |12] . 
Here, we provide a new proof of Lemma [2j It is simpler than previous proofs, and in Section H] 
we will see that it generalizes to cases when the margin is measured with norms other than £2- 

Lemma 2 (VC Dimension; Upper bound; Hilbert Space) In a Hilbert- space, the VC di- 
mension of a gap-tolerant classifier whose margin is A over a ball of radius R can by bounded 
above by [^J + 1. 

Proof: Suppose the VC dimension is n. Then there exists a set of n points X = {xi, . . . , x^} in 
B{R) that can be completely shattered using gap-tolerant classifiers. We will consider two cases, 
first that n is even, and then that n is odd. 
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First, assume that n is even, i.e., that n = 2k for some positive integer k. We apply the 
probabihstic method to obtain a upper bound on n. Note that for every set S Q [n], the set 
Xs ■= {xi\i E S} can be separated from X — Xs using a gap-tolerant classifier. Therefore the 
distance between the centroids (respective centers of mass) of these two sets is greater or equal 
to 2A. In particular, for each S having k = n/2 elements, 



I Sis 



> 2A. 



" k k 

Suppose now that S is chosen uniformly at random from the (^) sets of size k. Then, 



4A^ < E 



i^S ||2 



k 

2k + 1 



k 



< 



1 2k 

4(n + l)p2 
5 — -ft . 



E 

i=l 



2k 



Therefore, 



< 



< 



K — K 

i?2 



n 



and so 



n<^ + l. 



Next, assume that n is odd. We perform a similar calculation for n = 2k + 1. As before, we 
average over all sets S of cardinality k the squared distance between the centroid of Xg and the 
centroid (center of mass) of X — Xs- Proceeding as before. 



4A^ < E 



G5 



fc + 1 



J2i=l + 2r 



_1_| 

2n I 



E 



l<i<n 



< 



En 



fc(/t + l) 



A;(/t + l) 
4/C + 3 



< 



2/c(2/c + 1)(A: + 1) 



{(2A; + l)i?2} 



1 



Therefore, n < ^ + 1 



3.3 Learning with heavy-tailed data: proof of Theorem [T] 

Proof: For a random data sample x, 

|2 ^ Y^,'/^,— a^2 



Ex 



i=l 

< C2c(2a), 



(3) 
(4) 
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where C is the Riemann zeta function. The theorem then fohows from Theorem HI 

o 

3.4 Learning with spectral kernels: proof of Theorem [2] 

Proof: A Diffusion Map for the graph G = {V, E) is the feature map that associates with a vertex 
X, the feature vector x = (Z"/i(x), . . . ,l'^fm{x)), when the eigenfunctions corresponding to the 
top m eigenvalues are chosen. Let n be the uniform distribution on V and \ V\ = n. We note that 
if the fj are normahzed eigenfunctions, i.e., Vj, YIxgV fji^)"^ ~ 1' 

^ll^ll = »=1 < ^'=1 < 1. (5) 

n n 

The above inequahty holds because the eigenvalues have magnitudes that are less or equal to 1: 

1 = >...>/„> -1. 

The theorem then follows from Theorem HI 

o 



4 Gap-tolerant classifiers in Banach spaces 

In this section, we state and prove Theorem [71 our main result regarding an upper bound for the 
annealed entropy of gap-tolerant classifiers in a Banach space. We start in Section [4. II with some 
technical preliminaries; then in Section [4.21 we bound the VC dimension of gap-tolerant classifiers 
in Banach spaces over a ball of radius R] and finally in Section 14.31 we prove Theorem [71 We 
include this result for completeness since it is of theoretical interest; since it follows using similar 
methods to the analogous results for Hilbert spaces that we presented in Section [3l and since 
this result is of potential practical interest in cases where modeling the relationship between data 
elements as a dot product in a Hilbert space is too restrictive, e.g., when the data are extremely 
sparse and heavy-tailed. For recent work in machine learning on Banach spaces, see [71 121 1 [20 l 128]. 

4.1 Technical preliminaries 

Recall the definition of a Banach space from Definition [2] above. We next state the following form 
of the Chernoff bound, which we will use in the proof of Lemma [H below. 

Lemma 3 (Chernoff Bound) LetXi,...,Xn he discrete independent random variables such 
that E[Xj] = for all i and \Xi\ < 1 for all i. Let X = Ylll=i -^i /^'^ '^^^ ^ ^'^'^ '^^ variance 
of X . Then 

> Xa] < 2e-^'/^ 

for any < A < 2a. 

4.2 Bounds on the VC dimension of gap-tolerant classifiers in Banach spaces 

The idea underlying our new proof of Lemma [2] (of Section [3]2l and that provides an upper bound 
on the VC dimension of a gap-tolerant classifier in Hilbert spaces) generalizes to the case when 
the the gap is measured in more general Banach spaces. We state the following lemma for a 
Banach space of type p with type constant T. Recall, e.g., that £p for p > 1 is a Banach space of 
type min(2,p) and type constant 1. 
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Lemma 4 (VC Dimension; Upper bound; Banach Space) In a Banach Space of type p 

and type constant T , the VC dimension of a gap-tolerant classifier whose margin is A over a hall 

p 

of radius R can by bounded above by (^^) ''^^ +64 

Proof: Since a general Banach space does not possess an inner product, the proof of Lemma [2] 
needs to be modified here. To circumvent this difficulty, we use Inequality ^ determining 
the Rademacher type of B. This, while permitting greater generality, provides weaker bounds 
than previously obtained in the Euclidean case. Note that if := ^ Z^ILi then by repeated 
application of the Triangle Inequality, 



n 



< 2sup||xj 



This shows that if we start with xi, . . . , x„ having norm < R, \\xi — fi\\ < 2R for all i. The property 
of being shattered by gap-tolerant classifiers is translation invariant. Then, for C 5 C [n], it 
can be verified that 



2A < 



E^es(^i-l^) T.i^si^i-1^) 



\s\ 



n-\S\ 



n 



ie5 i<fs 



'•IIP 



2|5|(n-|5|) 
The Rademacher Inequality states that 

n n 
i=l 1=1 

Using the version of Chernoff's bound in Lemma [3] 

n 

nY.^^\ < > 1 - 2e-^'/^ 



(6) 



(7) 



(8) 



i=l 



We shall denote the above event by E\. Now, let xi, . . . ,Xn be n points in B with a norm less or 
equal to R. Let /x = ' as before. 

n 

2PTPnRP > iPT^Y^W^iW^ 

i=l 

n 

> TP^\\xi-fir 

> E,[||e,,(x,-M)in 

> E,[\\ei{xi-f,W\Ex]nEx] 

> E,[{n - A2)P(2Af(l - 2e-^'/^)] 

The last inequality follows from ([6]) and ([8|) . We infer from the preceding sequence of inequalities 
that 



n 



p-i < 2PT'P 



(1 



n 



)P(l-2e-^'/^] 
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The above is true for any A G (0,2-^/n), by the conditions in the ChernofF bound stated in 
Lemma [3l If n > 64, choosing A equal to 8 gives us n^" 

-1 < gpjnp (^Ry _ Therefore, it is always 

true that n < (^)^ + 64. 

o 

Finally, for completeness, we next state a lower bound for VC dimension of gap-tolerant 
classifiers when the margin is measured in a norm that is associated with a Banach space of type 
p G (1, 2]. Since we are interested only in a lower bound, we consider the special case of ip. Note 
that this argument does not immediately generalize to Banach spaces of higher type because for 
p > 2, ip has type 2. 

Lemma 5 (VC Dimension; Lower Bound; Banach Space) For each p G (1,2], there exists 
a Banach space of type p such that the VC dimension of gap-tolerant classifiers with gap A over 
a ball of radius R is greater or equal to 

■n 

R 
A 

Further, this bound is achieved when the space is ip. 

Proof: We shall show that the first n unit norm basis vectors in the canonical basis can be 

i-p 

shattered using gap-tolerant classifiers, where A = n p . Therefore in this case, the VC dimension 
is > (^)^- Let Cj be the j^^ basis vector. In order to prove that the set {ei, . . . , e^} is shattered, 
due to symmetry under permutations, it suffices to prove that for each k, {ei, . . . ,ek} can be 
separated from {ck+i, ■ ■ ■ ,en} using a gap-tolerant classifier. Points in ip are infinite sequences 
(xi,...) of finite ip norm. Consider the hyperplane H defined by Yli=i^i ~ Yl7=k+i^i ~ ^■ 
Clearly, it separates the sets in question. We may assume Cj to be ei, replacing if necessary, k by 
n — k. Let x = iniy^H ||ei — y\\p. Clearly, all coordinates Xn+i, ... of x are 0. In order to get a 
lower bound on the ip distance, we use the power- mean inequality: If p > 1, and xi, . . . , x„ G M, 

/T" -, \xi\P\ p . \xi\ 



\ n J ~ n 



This implies that 




For p > 2, the type of ip is 2 [13\. Since is a decreasing function of p in this regime, we do 
not recover any useful bounds. 
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4.3 Bound on the annealed entropy of gap-tolerant classifiers in Banach spaces 

The following theorem is our main result regarding an upper bound for the annealed entropy of 
gap-tolerant classifiers in Banach spaces. Note that the I2 bound provided by this theorem is 
slightly weaker than that provided by Theorem [Uj Note also that it may seem counter-intuitive 
that in the case of I2 {i.e., when we set 7 = 2), the dependence of A is A~^, which is weaker 
than in the VC bound, where it is A~^. The explanation is that the bound on annealed entropy 
here depends on the number of samples i, while the VC dimension does not. Therefore, the 
weaker dependence on A is compensated for by a term that in fact tends to 00 as the number of 
samples ^ — )• 00. 

Theorem 7 (Annealed entropy; Upper bound; Banach Space) Let V be a probability mea- 
sure on a Banach space B of type p and type constant T. Let 7, A > 0, and let rj = . // 
Epllxll''' = r"^ < 00, then the annealed entropy of gap-tolerant classifiers in B, where the gap is 
A, is 

HLM < {v-v - -r {^Y) " + 64) HI + 1). 

Proof: The proof of this theorem parallels that of Theorem [6l except that here we use Lemma H] 
instead of Lemma [2j We include the full proof for completeness. Let i independent, identically 
distributed (i.i.d) samples zi,. . . ,ze be chosen from V. We partition them into two classes: 

X = {xi, . . . := {zi I ll^ill > R}, 

and 

Y = {yi, ■■■iVk} ■■= {zi I \\zi\\ < R}. 

Our objective is to bound from above the annealed entropy Hanni^) — lnE[A^^(2i, . . . ,Z()]. By 
Lemma [H A^^ is sub- multiplicative. Therefore, 

iV^(zi,...,z^) < iV^(xi,...,x^_fc)7V^(yi,...,yfc). 

Taking an expectation over i i.i.d samples from "P, 

E[iV^(zi, . . . , z,)] < E[iV^(xi, . . .,x,.k)N^{yi, . . . ,yk)]. 

Now applying Lemma [H we see that 

E[iV^(zi, ...,Z,)]< E[iV^(xi, . . . , X,.,){k + 1)(3T«/A) A+64j_ 

V 

Moving {k + l)((2+o(i)™/A)'^) outside this expression, 

Note that N^{xi, . . . ,X£-k) is always bounded above by 2^"*^ and that the random variables 
> R]] are i.i.d. Let p = P[||xi|| > R], and note that i — k is the sum of i independent 
Bernoulli variables. Moreover, by Markov's inequality, 

mz.\\>R] < ^^"''"'^ 



R"f ' 

and therefore p < {■^)'^ ■ In addition, 

E[Af^(xi,...,x,_fe)] <E[2^-'=] 
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Let /[•] denote an indicator variable. E[2^ can be written as 

1=1 

Putting everything together, we see that 

E[iV^(zi, . . . , z,)] < exp (^^' + ln(fe + 1) ^64 + ^) . (9) 
By setting rj := , and adjusting R so that 

«(^)^,r' = (i-,)-'M«+i)(^)*. 



We see that 

r 



A 



Thus, it follows that 



5 Discussion 

In recent years, there has been a considerable amount of somewhat-related technical work in a 
variety of settings in machine learning. Thus, in this section we will briefly describe some of the 
more technical components of our results in light of the existing related literature. 

• Techniques based on the use of Rademacher inequalities allow one to obtain bounds without 
any assumption on the input distribution as long as the feature maps are uniformly bounded. 
See, e.g., [lOl [m [H [13] . Viewed from this perspective, our results are interesting because the 
uniform boundedness assumption is not satisfied in either of the two settings we consider, 
although those settings are ubiquitous in applications. In the case of heavy-tailed data, 
the uniform boundedness assumption is not satisfied due to the slow decay of the tail and 
the large variability of the associated features. In the case of spectral learning, uniform 
boundedness assumption is not satisfied since for arbitrary graphs one can have localization 
and thus large variability in the entries of the eigenvectors defining the feature maps. In 
both case, existing techniques based on Rademacher inequalities or VC dimensions fail to 
give interesting results, but we show that dimension-independent bounds can be achieved 
by bounding the annealed entropy. 
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• A great deal of work has focused on using diffusion-based and spectral-based methods for 
nonlinear dimensionality reduction and the learning a nonlinear manifold from which the 
data are assumed to be drawn [2^ . These results are very different from the type of learning 
bounds we consider here. For instance, most of those learning results involve convergence 
to an hypothesized manifold Laplacian and not of learning process itself, which is what we 
consider here. 

• Work by Bousquet and Elisseeff [3] has focused on establishing generalization bounds based 
on stability. It is important to note that their results assume a given algorithm and show 
how the generalization error changes when the data are changed, so they get generalization 
results for a given algorithm. Our results make no such assumptions about working with a 
given algorithm. 

• Gurvits |10J has used Rademacher complexities to prove upper bounds for the sample com- 
plexity of learning bounded linear functionals on ^p balls. The results in that paper can 
be used to derive an upper bound on the VC dimension of gap-tolerant classifiers with 
margin A in a ball of radius 72 in a Banach space of Rademacher type p G (1, 2]. Constants 
were not computed in that paper, therefore our results do not follow. Moreover, our paper 
contains results on distribution specific bounds which were not considered there. Finally, 
our paper considers the application of these tools to the practically-important settings of 
spectral kernels and heavy-tailed data that were not considered there. 

6 Conclusion 

We have considered two simple machine learning problems motivated by recent work in large- 
scale data analysis, and we have shown that although traditional distribution-independent meth- 
ods based on the VC-dimension fail to yield nontrivial sampling complexity bounds, we can 
use distribution-dependent methods to obtain dimension-independent learning bounds. In both 
cases, we take advantage of the fact that, although there may be individual data points that are 
"outlying," in aggregate their effect is not too large. Due to the increased popularity of vector 
space-based methods (as opposed to more purely combinatorial methods) in machine learning 
in recent years, coupled with the continued generation of noisy and poorly-structured data, the 
tools we have introduced are likely promising more generally for understanding the effect of noise 
and noisy data on popular machine learning tasks. 

A Exact learning with heavy-tailed data 

In this appendix section, we state and prove a second result for dimension-independent learning 
from data in which the feature map exhibits a heavy-tailed decay. The heavy-tailed model we 
consider here is different than that considered in Theorem[Tl and thus we are able to prove bounds 
for exact (as opposed to maximum margin) learning. Nevertheless, the techniques are similar, 
and thus we include this result in this paper for completeness. 

Consider the following toy model for classifying web pages using keywords. One approach 
to this problem could be to associate with each web page the indicator vector corresponding 
to all keywords that it contains. The dimension of this feature space is the number of possible 
keywords, which is typically very large, and empirical evidence indicates that the frequency of 
words decays in a heavy-tailed manner. Thus the VC dimension of the feature space is very 
large, and in a distribution-free setting it is not possible to classify data in such a feature space 
unless the number of samples is of the order of the VC dimension. More generally, one might 
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be interested in a bipartite graph, e.g., an "advertiser-keyword" or "author-to-paper" graph, in 
which the nodes are the stated entities and the edges represent some sort of "interaction" between 
the entities, in which case similar issues arise. 

Here, we show that if the probability that the i*'' keyword in the above toy example is present 
is heavy-tailed as a function of i, then the sample complexity of the binary classification problem is 
dimension-independent. More precisely, the following theorem provides a dimension-independent 
{i.e., independent of the size n of the graph and the dimension of the feature space) upper bound 
on the number of samples needed to learn by ERM, with a given accuracy and confidence, a 
linear hyperplane that classifies heavy-tailed data into positive and negative labels, under the 
assumption that the probability of the i*'* coordinate of a random data point being non-zero is 
less than Ci~" for some C > 0, a > 1. The proof of this result proceeds by providing providing 
a dimension-independent upper bound on the annealed entropy of the class of linear classifiers in 
R"^, and then appealing to Theorem [5] relating the annealed entropy to the generalization error. 

Remark: Note that although the generalization bound provided by the following theorem 
seems to be pessimistic in a, the dependence on a is tight, at least as a tends to 1. Clearly, 
when a = 1, the expected number of I's in a random sample becomes asymptotically equal to 
logn, where n is the dimension, in which case, we do not expect a sample complexity that is 
dimension-independent . 

Theorem 8 (Bounds for Heavy- Tailed Data) LetV be a probability distribution in . Sup- 
pose V[xi 7^ 0] < Ci^" for some absolute constant C > 0, with a > 1. Then, the annealed entropy 
of ordinary linear hyperplane classifiers is 

H^M < + (10) 

Consequently, the minimum number of random samples I = i{e, 6) needed to learn, by ERM, a 
classifier whose risk differs from the minimum risk R{a) by < ey^Rjct) with probability > 1 — 6 
is less than or equal to 




Proof: Let the event that a sample Zj = {zii,Zi2, • • • ) has a non-zero coordinate Zj^/ for some 
k' > be denoted E^. The probability of this event can be bounded as follows. If q 7^ 1 
and k = then 

F[Ei] = F[3k' > tl'^, such that z,^, ^ 0] 
00 



< 



i=k+l 

a. — 1 



We partition the Zj into two classes : 

X = {xi, . . . , := {zi such that Ei holds } 

and 

Y = {yi, . . . , ym} ■= {zi such that Ei does not hold }. 
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is sub-multiplicative by Lemma [TJ Taking an expectation over i i.i.d samples from V, 
E[iV^(zi,...,z,)] < E[N^{xi,...,xe^m)N^{yi,...,ym)] 

The dimension of the span of {yi, . . . ,ym} is at most k, and by a result from VC theory ([27], 
page 159) we have 

N^{yi,...,ym)<exp{kln{j) + 1). 

Then, 

Moving em^ outside this expression, 

E\N^{zi, ...,z,)]< E[N^{xu x,^k)]em\ 

Note that A^^(xi, . . . , xi-k) is always bounded above by 2^~^ and that the events Ei,E2, ■ ■ ■ are 
independent identically distributed. Let p = ]P[-Ej], and note that £ — k is the sum of ^ independent 
p-Bernoulli variables. In addition, 

E[iV^(xi,...,x,_,,)] <E[2^-^], 

and £[2^^^^] can be written as 

na+mi) = (i+p)' (11) 



1=1 



< (12) 



(13) 



Putting everything together, we see that 

E[iV^(zi,...,z,)] <e(£)'=e^^. 

Since = , we see that 

Htni^) = \nE[N\z,,...,z,)] (14) 

In order to obtain sample complexity bounds, we need to apply Theorem [5] and substitute the 
above expression for annealed entropy. For the probability that the error of ERM exceeds ey'Ria) 
to be less than 6 (where a is the optimal classifier), it is sufficient that i satisfy 

£^ ln(2£) - eV4 \ i < 6. 

a-l J 

For this to be true, it is enough that 

4 a — 1 



A calculation shows that 



2«(^(i^+ln|))"'^nU(f|+lnf 



a — 1 

is a value of £ that satisfies the previous expression. 
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